Ethical Data Science
What is personal data?
Should not be collected, analysed or distributed without consent.
No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks. - UN General Assembly, 1948
‘Consent’ of the data subject means any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her; - GDPR Article 4
Pseudonmisation: processing data so that it does not relate to an identifiable person.
Re-identification: elating a pseudonymised data entry to an identifiable person.
Anonymisation: A pseudonmisation method that precludes re-identification.
Pseudo-identifiers: Attributes that can also be observed in public data. For example, someone’s name, job title, zip code, or email.
For the set of quasi-identifiers \(A_1, \ldots ,A_p\), a table is \(k\)-anonymous if each possible value assignment to these variables \((a_1, . . . , a_n)\) is observed for either 0 or at least \(k\) individuals.
| Post Code | Age | Drug Use | Condition | |
|---|---|---|---|---|
| 1 | OX1**** | <20 | * | Herpes |
| 2 | OX1**** | <20 | * | Herpes |
| 3 | OX2**** | >=30 | * | Chlamydia |
| 4 | OX2**** | >=30 | * | Herpes |
| 5 | OX1**** | <20 | * | Gonorrhea |
| 6 | OX2**** | >=30 | * | Gonorrhea |
| 7 | OX1**** | <20 | * | Gonorrhea |
| 8 | LA1**** | 2* | * | Chlamydia |
| 9 | LA1**** | 2* | * | Chlamydia |
| 10 | OX2**** | >=30 | * | Gonorrhea |
| 11 | LA1**** | 2* | * | Chlamydia |
| 12 | LA1**** | 2* | * | Chlamydia |
| Post Code | Age | Drug Use | Condition | Equivalence Class | |
|---|---|---|---|---|---|
| 1 | OX1**** | <20 | * | Herpes | 1 |
| 2 | OX1**** | <20 | * | Herpes | 1 |
| 3 | OX2**** | >=30 | * | Chlamydia | 2 |
| 4 | OX2**** | >=30 | * | Herpes | 2 |
| 5 | OX1**** | <20 | * | Gonorrhea | 1 |
| 6 | OX2**** | >=30 | * | Gonorrhea | 2 |
| 7 | OX1**** | <20 | * | Gonorrhea | 1 |
| 8 | LA1**** | 2* | * | Chlamydia | 3 |
| 9 | LA1**** | 2* | * | Chlamydia | 3 |
| 10 | OX2**** | >=30 | * | Gonorrhea | 2 |
| 11 | LA1**** | 2* | * | Chlamydia | 3 |
| 12 | LA1**** | 2* | * | Chlamydia | 3 |
There are three main ways that you can improve the privacy within a dataset:
Redaction (of columns or rows)
Aggregation (Continuous -> discrete or combining discrete groups)
Corruption / Noise
Lack of diversity in private attributes within an equivalence class
| Post Code | Age | Drug Use | Condition | |
|---|---|---|---|---|
| 8 | LA1**** | 2* | * | Chlamydia |
| 9 | LA1**** | 2* | * | Chlamydia |
| 11 | LA1**** | 2* | * | Chlamydia |
| 12 | LA1**** | 2* | * | Chlamydia |
Also vulnerable to external data-linkage attacks.
| User ID | Film ID | Rating | Date |
|---|---|---|---|
| 000001 | 548782 | 5 | 2001-01-01 |
| 000001 | 549325 | 1 | 2001-01-01 |
| … | … | … | … |
Effective Data Science: Ethics - Privacy - Zak Varty